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Algorithmic statistics revisited 

Nikolay Vereshchagin* Alexander Shcn' 


Abstract 

The mission of statistics is to provide adequate statistical hypotheses (models) 
for observed data. But what is an “adequate” model? To answer this question, 
one needs to use the notions of algorithmic information theory. It turns out that 
for every data string x one can naturally define “stochasticity profile”, a curve 
that represents a trade-off between complexity of a model and its adequacy. This 
curve has four different equivalent definitions in terms of (1) randomness defi¬ 
ciency, (2) minimal description length, (3) position in the lists of simple strings 
and (4) Kolmogorov complexity with decompression time bounded by busy beaver 
function. We present a survey of the corresponding definitions and results relating 
them to each other. 


1 What is algorithmic statistics? 

The laws of celestial mechanics allow the astronomers to predict the observed motion 
of planets in the sky with very high precision. This was a great achievement of modern 
science—but could we expect to find equally precise models for all other observations? 
Probably not. Thousands of gamblers spent all theirs lives and their fortunes trying to 
discover the laws of the roulette (coin tossing, other games of chance) in the same 
sense—but failed. Modem science abandoned these attempts. It says modestly that all 
we can say about the coin tossing is the statistical hypothesis (model): all trials are 
independent and (for a fair coin ) both head and tail have probability 1 /2. The task of 
mathematical statistics therefore is to find an appropriate model for experimental data. 
But what is “appropriate” in this context? 

To simplify the discussion, let us assume that experimental data are presented as 
a bit string (say, a sequence of zeros and ones corresponding to heads and tails in the 
coin tossing experiment). We also assume that a model is presented as a probability 
distribution on some finite set of binary strings. For example, a fair coin hypothesis 
for N coin tossings is a set of all strings of length N where all elements have the same 
probability 2~ N . Restricting ourselves to the simplest case when a hypothesis is some 
set A of strings with uniform distribution on it, we repeat our question: 

Assume that a bit string x (data) and a set A containing x (a model) are 
given; when do we consider A as a good “explanation” forx? 
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Some examples show that this question cannot be answered in the framework of 
classical mathematical statistics. Consider a sequence x of 100 bits (the following 
example is derived from the random tables Ii20l ): 

01111 10001 11110 10010 00001 00011 00001 10010 00010 11101 
10111 11110 10000 11100 00111 00000 0111101100 11011 01011 

Probably you would agree that the statistical hypothesis of a fair coin (the set A = 
B 100 of all 100-bit sequences) looks as an adequate explanation for this sequence. On 
the other hand, you probably will not accept the set A as a good explanation for the 
sequence y: 

00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000 

but will suggest a much better explanation B = {y} (the coin that always gives heads). 
On the other hand, set C = {x} does not look like a reasonable explanation forx. How 
can we justify this intuition? 

One could say that A is not an acceptable statistical hypothesis for y since the prob¬ 
ability of y according to A is negligible (2~ 100 ). However, the probability of x for this 
hypothesis is the same, so why is A acceptable for x then? And if B looks like an 
acceptable explanation for y, why C does not look as an acceptable explanation for x? 

The classical statistics, where x and y are just two equiprobable elements of A, can¬ 
not answer these questions. Informally, the difference is that x looks like a “random” 
element of A while y is “very special”. To capture these difference, we need to use the 
basic notion of algorithmic information theory, Kolmogorov complexity^] and say that 
x has high complexity (cannot be described by a program that is much shorter than x 
itself) while y has small complexity (one can write a short program that prints a long 
sequence of zeros). This answers our first question and explains why A could be a good 
model for x but not for y. 

Another question we asked: why B is an acceptable explanation for y while C is not 
an acceptable explanation for x? Here we need to look at the complexity of the model 
itself: C has high complexity (because x is complex) while B is simple (because y is 
simple). 

Now let us consider different approaches to measuring the “quality” of statistical 
models; they include several parameters and a trade-off between them arises. In this 
way for every data string x we get a curve that reflects this trade-off. There are different 
ways to introduce this curve, but they are all equivalent with O(logrc) precision forn-bit 
strings. The goal of this paper is to describe these approaches and equivalence results. 


2 ( a , (3) -stochastic obj ects 

Let us start with the approach that most closely follows the scheme described above. 
Let x be a string and let A be a finite set of strings that contains x. The “quality” of A 
as a model (explanation) for x is measured by two parameters: 

1 We assume that the reader is familiar with basic notions of algorithmic information theory and use them 
freely. For a short introduction see EH; more information can be found in ED 
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• the Kolmogorov complexity C(A) of A; 

• the randomness deficiency d (x | A) of x in A. 

The second parameter measures how “non-typical” is x in A (small values mean that x 
looks like a typical element of A) and is defined as 

d(x\A) = log#A — C(x\A). 

Here log stands for binary logarithm, #A is the cardinality of A and C(u | v) is the 
conditional complexity of u given v. Using A as the condition, we assume that A is 
presented as a finite list of strings (say, in lexicographical ordering). The motivation 
for this definition: for all x £ A we have C(x |A) < log#A + (9(1), since every x € A is 
determined by its ordinal number in A; for most x £ A the complexity C(x | A) is close 
to log#A since the number of strings whose complexity is much less than log#A, is 
negligible compared to #A. So the deficiency is large for strings that are much simpler 
than most elements of A@ 

According to this approach, a good explanation A for x should make both param¬ 
eters small: A should be simple and x should be typical in A. It may happen that 
these two goals cannot be achieved simultaneously, and a trade-off arises. Following 
Kolmogorov, we say that x is ( a,fi)-stochastic if there exists A containing x such that 
C(A) < a and d{x\A) < (3. In this way we get an upward closed set 

S(x) = {(a,j3) |xis (a,/3)-stochastic} 

If x is a string of length n, the set A of all «-bit strings can be used as a description; it 
gives us the pair (<9(logu),n — C(x) + (9(logn)) in .S'(x). Indeed, we can describe A us¬ 
ing (9(logn) bits and the deficiency is n — C(x|A) =n — C(x\n) =n — C(x) + 0(\ogn). 
On the other hand, there is a set A 9 x of complexity C(x) + <9( 1) and deficiency (9(1) 
(namely, A = {x}). So the boundary of the set S(x) starts below the point (0 ,n — C(x)) 
and decreases to (C(x),0) for arbitrary n -bit stringx, if we consider S(x) with(9(log«) 
precision^ 

The boundary line of S(x) can be called a stochasticity profile of x. As we will see, 
the same curve appears in several other situations. 

3 Minimum description length principle 

Another way to measure the “quality” of a model starts from the following observation: 
if x is an element of a finite set A, then x can be described by providing two pieces of 
information: 

2 There is an alternative definition of d(x\A). Consider a function t of two arguments x and A, defined 
when x £ A, and having integer values. We say that t is lower semicomputable if there is an algorithm that 
(given x and A) generates lower bounds for t(x,A) that converge to the true value of t(x,A) in the limit. 
We say that t is a probability-bounded test if for every A and every positive integer k the fraction of x £ A 
such that t(x,A) > k is at most l/k. Now d(x\A) can be defined as the logarithm of the maximal (up to 
0(1)-factor) lower semicomputable probability-bounded test. 

3 As it is usual in algorithmic information theory, we consider the complexities up to 0(\ogn) precision 
if we deal with strings of length at most n. Two subsets S, T cZ 2 3 are the same for us if S is contained in the 
O (log n) -neighborhood of T and vice verse. 
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• the description of A; 

• the ordinal number of x in A (with respect to some ordering fixed in advance). 
This gives us the inequality 

C(x) < C(A)+log#A 

that is true with precision 0(log») for strings x of length at most 

The quality of the hypothesis A is then measured by the difference 

<5(x,A) = C(A) + log#A - C(x) 

between the sides of this inequality. We may call it “optimality deficiency” of A, since 
it shows how much do we lose in the length of the description if we consider two-part 
description based on A instead of the best possible one. For a given string x we can 
then consider the set 0(x) of pairs (a, ft) such that x has a model of complexity at most 
a and optimality deficiency at most p. 

Theorem 1 For every string x of length at most n the sets 5(x) and 0(x) coincide 
with O(\ogn)-precision: each of them is contained in the O{\ogn)-neighborhood of the 
other one. 

Speaking about neighborhoods, we assume some standard distance on M 2 (the exact 
choice does not matter, since we measure the distance up to a constant factor). 

Let us note now that in one direction the inclusion is straightforward. A simple 
computation shows that the randomness deficiency is always less than the optimality 
deficiency of the same model (and the difference between them equals C(A |x), where 
A is this model). 

The opposite direction is more complicated: a model with small randomness de¬ 
ficiency may have large optimality deficiency. This may happens when C(A|x) is 
largeo However, in this case we can find another model and decrease the optimal¬ 
ity deficiency as needed: for every string x and every model A for x (a finite set A 
that contains x) there exists another model A' for x such that log#(A / ) = log#A and 
C{A') < C(A) — C(A |x) +0(}ogn), where n is the length of x. This result looks sur¬ 
prising at first, but note that if C(A|x) is large, then there are many sets A' that are 
models of the same quality (otherwise A can be reconstructed from x by exhaustive 
search). These sets can be used to find A' with required properties. 

The definition of the set 0(x) goes back to Kolmogorov flOl : however, he used a 
slightly different definition: instead of 0(x ) he considered the function 

h x (a) = min{log#A : x € A, C(A) < a}, 

A 

4 The additional term 0(logC(A)) should appear in the right hand side, since we need to specify where 
the description of A ends and the ordinal number of x starts, so the length of the description (C(A)) should 
be specified in advance using some self-delimiting encoding. One may assume that C(A) < n, otherwise the 
inequality is trivial, so this additional term is 0(}ogn). 

5 Let x and y be independent random strings of length n, so the pair (x,y) has complexity close to 2 n. 
Assume that x starts with 0 and y starts with 1. Let A be the set of strings that start with 0, plus the string 
y. Then A, considered as a model for x, has large optimality deficiency but small randomness deficiency. To 
decrease the optimality deficiency, we may remove y from A. 
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Figure 1: The pair (a, j3) lies on the boundary of (Hx) since the point (a. C(x) ~ a +f>) 
lies on the graph of h x . 


now called Kolmogorov structure function. Both O(x) and h x are determined by the set 
of all pairs (C(A),log#A) for finite sets A containing x, though in a slightly different 
way (since the inequality <j(;r,A) < /3 in the definition of 0(x) combines C(A) and 
log#A). One can show, however, that the following statement is true with 0(log«)- 
precision for each n-bit string x: the pair (a,j3) is in 0(x) if and only if h x (a) < 
ft + C(x) — a. So the graph of h x is just the boundary of O(x) in different coordinates. 


4 Lists of simple strings 

We have seen two approaches that describe the same trade-off between the complexity 
of a model and its quality: for every x there is some curve (defined up to O(logn)- 
precision) that shows how good can be a model with bounded complexity. Both ap¬ 
proaches gave the same curve with logarithmic precision; in this section we give one 
more description of the same curve. 

Let m be some integer. Consider the list of strings of complexity at most m. It can 
be generated by a simple algorithm: just try in parallel all programs of length at most 
m and enumerate all their outputs (without repetitions). This algorithm is simple (of 
complexity 0(\ogm)) since we only need to know m. 

There may be several simple algorithms that enumerate all strings of complexity at 
most m, and they can generate them in different orders. For example, two algorithms 
may start by listing all the strings of length m — 0(1) (they all have complexity at most 
m), but one does this in the alphabetical order and the other uses the reverse alphabetical 
order. So the string 00... 00 is the first in one list and has number 2 m ' oi ' [> in the other. 
But the distance/row the end of the list is much more invariant: 

Theorem 2 Consider two programs of complexity 0{\ogm) that both enumerate all 
strings of complexity at most m. Let x be one of these strings. If there is at least 2 k 
strings after x in the first list, then there is at least 2 <l '~ 0 ( 1o S"') strings after x in the 
second list. 
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Figure 2: To find how many strings appear after x in the list of all strings of complexity 
at most m, we draw a line starting at (0, m) with slope — 1 and intersect it with the graph 
of h x \ if the second coordinate of the intersection point is k, there are about 2 k strings 
after x in this list. 


In this theorem we consider two algorithms that enumerate the same strings in 
different orderings. However, the Kolmogorov complexity function depends on the 
choice of the optimal decompressor (though at most by (9(1) additive term), so one 
could ask what happens if we enumerate the strings of bounded complexity for two 
different versions of the complexity function. A similar result (with similar proof) says 
that the change of an optimal decompressor used to define Kolmogorov complexity can 
be compensated by 0(logm)-change in the threshold m. 

Now for every m fix an algorithm of complexity at most <9(log)m that enumerates 
all strings of complexity at most m. Consider a binary string x; it appears in these lists 
for all m > C(x). Consider the logarithm of the number of strings that follow x in the 
m-th list. We get a function that is defined for all m > C(x) with (9(logm) precision. 
The following result shows that this function describes the stochasticity profile of x in 
different coordinates. 

Theorem 3 Let x be a string of length at most n. 

(a) Assume that x appears in the list of strings of complexity at most m and there 
are at least 2 k strings after x in the list. Then the pair ((m — k) + (9(log«), m — C(x)) 
belongs to the set (9(x). 

(b) Assume that the pair (m — k,m — C(x)) belongs to (9(x). Then x appears in 
the list of strings of complexity at most m + 0(\ogn) and there are at least 2 k ~°^°& n > 
strings after it. 

By Theorem[I]the same statement holds for the set S(x) in place of 0(x). 

Ignoring the logarithmic correction and taking into account the relation between 
0(x) and h tl one can illustrate the statement of TheoremOby Figured 
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5 Time-bounded complexity and busy beavers 


There is one more way to get the stochasticity profile curve. Let us bound the compu¬ 
tation time (number of steps) in the definition of Kolmogorov complexity and define 
C 1 (x) as the minimal length of a program that produces x in at most t steps. Evidently, 
C l (x) decreases as t increases, and ultimately reaches C(x)f] However, the conver¬ 
gence speed may be quite different for different x of the same complexity. It is possible 
that for some x the programs of minimal length produce x rather fast, while other x 
can be compressible only if we allow very long computations. Informally, the strings 
of the first type have some simple internal structure that allows us to encode them effi¬ 
ciently with a fast decoding algorithm, while the strings of the second type have “deep” 
internal structure that is visible only if the observer has a lot of computational power. 

We use the so-called “busy beaver numbers” as landmarks for measuring the com¬ 
putation time. Let BB(n) be the maximal running time of all programs of length at most 
n (we use the programming language that defines Kolmogorov complexity, and some 
fixed interpreter for it)Q One can show that numbers BB(n) have equivalent definition 
in terms of Kolmogorov complexity: BB(n) is the maximal integer that has complexity 
at most n. (More precisely, if Bin) is the maximal integer that has complexity at most 
n, then B(n—c ) < BB(n ) < B(n Tc) for some c and all n, and we ignore 0(l)-changes 
in the argument of the busy beaver function.) 

Now for every x we may consider the decreasing function i H > C BB ^\x) — C(x) (it 
decreases fast for “shallow” x and slowly for “deep” x; note that it becomes close to 0 
when i = C(x), since then every program of length at most C(x) terminate in BB(C(x)) 
steps.). The graph of this function is (with logarithmic precision) just a stochasticity 
profile, i.e., the set of points above the graph coincides with 0(x ) up to a O(\ogn ) error 
term: 

Theorem 4 Let x be a string of length n. 

(a) If a pair (a, ft) is in O(x), then 

c BB(a+0(\ogn)) (x) < C(x)+j3+0(logn). 

(b) IfC BB ( a \x) < C(x) + P, then the pair (a + <9(log«),j3 + O(logn)) is in 0(x). 

By Theorem|T|the same statement holds for the set .S'(x) in place of O(x). 

6 One may ask which computational model is used to measure the computation time, and complain that 
the notion of time-bounded complexity may depend on the choice of an optimal programming language 
(decompressor) and its interpreter. Indeed this is the case, but we will use very rough measure of computation 
time based on busy beaver function, and the difference between computational models does not matter. The 
reader may assume that we fix some optimal programming language, and some interpreter (say, a Turing 
machine) for this language, and count the steps performed by this interpreter. 

7 Usually n- th busy beaver number is defined as the maximal running time or a maximal number of non¬ 
empty cells that can appear after Turing machine with at most n states terminates starting on the empty tape. 
This gives a different number; we modify the definition so it does not depend on the peculiarities of encoding 
information by transition tables of Turing machines. 
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6 How the stochasticity profile can look like? 


We have seen four different definitions that lead to the same (with logarithmic preci¬ 
sion) notion of stochasticity profile. We see now that finite objects (strings) not only 
could have different complexities, but also the strings with the same complexity can be 
classified according to their stochasticity profiles. 

However, we do not know yet that this classification is non-trivial: what if all strings 
of given complexity have the same stochasticity profile? The following result answers 
this question by showing that every simple decreasing function appears as complexity 
profile of some string. 

Theorem 5 Assume that some integers n and k<n are given, and h is a non-increasing 
function mapping {0,1,..., k} to {0,1— k}. Then there exists a string x of length 
n + 0(\ogn) + 0{C(h)) and complexity k+ O(logn) +0(C(h))for which the set 0(x) 
( and hence the set S(,r)) coincides with the upper-graph of h ( the set {{i,j) \ j > 
h(i) or i > k }) with 0(logn + C{h)) accuracy. 

Note that the error term depends on the complexity of h. If we consider simple 
functions h, this term is absorbed by our standard error term O(logn). In particular, 
this happens in two extreme cases: for the function h = 0 and the function h that is 
equal to n — k everywhere. In the first case it is easy to find such a “shallow” x: just 
take an incompressible string of length k and add n — k trailing zeros to get a n-bit 
string. For the second case we do not know a better example than the one obtained 
from the proof of Theorem^ 

Let us say informally that a string x of length n is “stochastic” if its stochasticity 
profile S(x) is close to the maximal possible set (achieved by the first example) with 
logarithmic precision, i.e., x is (<9(logn),<9(log/i))-stochastic. We know now that non¬ 
stochastic objects exist in the mathematical sense; a philosopher could ask whether they 
appear in the “real life”. Is it possible that some experiment gives us data that do not 
have any adequate statistical model? This question is quite philosophical since having 
an object and a model we cannot say for sure whether the model is adequate in terms 
of algorithmic statistics. For example, the current belief is that the coin tossing data are 
described adequately by a fair coin model. Still it is possible that future scientists will 
discover some regularities in the very same data, thus making this model unsuitable. 

We discuss the properties of stochastic objects in the next section. For now let 
us note only that this notion remains essentially the same if we consider probability 
distributions (and not finite sets) as models. Let us explain what does it mean. 

Consider a probability distribution P on a finite set of strings with rational values. 
It is a constructive object, so we can define the complexity of P using some computable 
encoding. The conditional complexity C(-\P) can be defined in the same way. Let us 
modify the definition of stochasticity and say that a string* is “(a,/3)-p-stochastic” if 
there exists a distribution P of the described type such that 

• C{P) is at most a; 

• d(x\P), defined as —logP(*) — C(x|F), does not exceed /3. 
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This is indeed a generalization: if P is a uniform distribution, then the complexity of P 
is (up to (9(1)) the complexity of its support A, the value of —logP(x) is log#A, and 
using P and A as conditions gives the same complexity up to (9(1). On the other hand, 
this generalization leads only to a logarithmic change in the parameters: 

Theorem 6 If some string x of length n is (cc . ff)-p-stochastic, then the siting x is also 
(a + O (log n),p + 0(logn))-stochastic. 

Since all our statements are made with (9 (log n ) -precision, we may identify stochas- 
ticity with p-stochasticity (as we do in the sequel). 

7 Canonical models 

Let £2„, denote the number of strings of complexity at most m. Consider its binary 
representation, i.e., the sum 

Q. w = 2 S1 + 2 S2 + ... + 2 s ' , where ,?i > S 2 > ■ ■ ■ > s t . 

According to this decomposition, we may split the list itself into groups: first 2' V| ele¬ 
ments, next 2 Sl elements, etc@ If x is a string of complexity at most m, it belongs to 
some group, and this group can be considered as a model for x. 

We may consider different values of m (starting from C(x)). In this way we get 
different models of this type for the same x. Let us denote by B ms the group of size 
2' that appears in the m-th list. Note that B ms is defined only for ,v that correspond to 
ones in the binary representation of £2 m . The models B m s are called canonical models 
in the sequel. The parameters of B m s are easy to compute: the size is 2 s by definition, 
and the complexity is m — s + <9(logm). 

Theorem 7 (a) Every canonical model for a string x lies on the boundary of 0(x) (i.e., 
its parameters cannot be improved more than by (9(logn) where n is the length ofx). 

(b) For every point in 0(x) there exists a canonical model that has the same or 
better parameters (with (9(log«) precision). 

The second part of this theorem says that for every model A for x we can find a 
canonical model B ms that has the same (or smaller) optimality deficiency, and C (B,„ jS ) < 
C(A) with logarithmic precision. In fact, the second part of this statement can be 
strengthened: not only C(B„ ijS ) < C(A), but also C(B„ hS |A) = (9(logn). 

This result shows that (in a sense) we may restrict ourselves to canonical models. 
This raises the question: what are these models? What information they contain? The 
answer is a bit confusing: the information in models B m s depends on m — s only and 
is the same as the information in Fl m - S , the number of strings of complexity at most 
m — s: 

Theorem 8 For all models B m s both conditional complexities C(B ms | Q, n _ ( ) and 
C(Q m -s|-B m ,j) are <9(logm). 

8 We assume that an algorithm is fixed that, given m, enumerates all strings of complexity at most m in 
some order. 
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One could note also that the information in O/ is a part of the information in O/ for 
Z>*(i.e.,C(fli|a,) = 0(log/))B 

Now it seems that finding a good model for x does not provide any specific in¬ 
formation about x : all we get (if we succeed) is the information about the number of 
terminating programs of bounded length, which that had nothing to do with x and is 
the same for all x. 

It is not clear how this philosophical collision between our goals and our achieve¬ 
ments can be resolved. One of the approaches is to consider total conditional com¬ 
plexity. This approach still leaves many questions open, but let us shortly describe it 
nevertheless. 

We have said that “string a and b contain essentially the same information” if both 
C(a\b) and C(b \a) are small. This, however, does not guarantee that the properties 
of a and b are the same. For example, if x* is the shortest program for some highly 
non-stochastic string x , the string x* itself is perfectly stochastic. 

To avoid this problem, we can consider total condition complexity CT(a \ b) defined 
as the minimal length of a total program p such that p(b) = a. Here p is called total if 
p(b') halts for all //, not only for /j j 1 1 1 This total conditional complexity can be much 
bigger than the standard conditional complexity C(a\b). It has the following property: 
if both CT(a \ b) and CT(b \ a) are negligible, there exists a computable permutation of 
low complexity that maps b to a, and therefore the sets 0(a) and 0(b) are close to each 
other. (See 03 for more details.) 

Using this notion, we may consider a set A as a “strong” model if it is close to the 
boundary of 0(x) and at the same time the total complexity CT(A |x) is small. The 
second condition is far from trivial: one can prove that for some strings x such strong 
models do not exist at all (except for the trivial model {x} and the models of very 
small complexity) 11271 . But if strong models exists, they have some nice properties: 
for example, the stochasticity profile of every strong sufficient statistic for x is close 
to the profile of the string x itself l26l . (A model is called a sufficient statistic for x if 
the optimality deficiency is small, i.e., the sum of its complexity and log-cardinality is 
close to C(x).) The class of all sufficient statistics for x does not have this property (for 
some x). 

Returning to the stochasticity profile, let us mention one more non-existence result. 
Imagine that we want to find a place when the set 0(x) touches the horizontal coor¬ 
dinate line. To formulate a specific task, consider for a given string of length n two 
numbers. The first, oq, is the maximal value of a such that (a.0.1/;) does not belong 
to 0(x)\ the second, ct 2 , is the minimal value of a such that (a. 10log/;) belongs to 
0(x). (Of course, the constant 10 is chosen just to avoid additional quantifiers, any 
sufficiently large constant would work.) Imagine that we want, given x and C(x), to 
find some point in the interval (oq, OC 2 ), or even in a slightly bigger one (say, adding 
the margin of size 0.1/; in both directions). One can prove that there is no algorithm 
that fulfills this task |24| . 

5 In fact, ill- contains the same information (up to 0(log&) conditional complexity in both directions) as 
first k bits of Chaitin’s O-number (a lower semicomputable random real), so we use the same letter O to 
denote it. 

10 As usual, we assume that the programming language is optimal, i.e., gives 0(1)-minimal value of the 
complexity compared to other languages. 
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8 Stochastic objects 


The philosophical questions about non-stochastic objects in the “real world” motivate 
several mathematical questions. Where do they come from? can we obtain a non¬ 
stochastic object by applying some (simple) algorithmic transformation to a stochastic 
one? Can non-stochastic objects appear (with non-negligible probability) in a (simple) 
random process? What are the special properties of non-stochastic objects? 

Here are several results answering these questions. 

Theorem 9 Let f be a computable total function. If string x of length n is 
stochastic, then f(x) is (a + C(f) + 0(\ogn),fj + C(f) + O(\ogn))-stochastic. 

Here C (/) is the complexity of the program that computes /. 

An important example: let / the projection function that maps every pair ( x,y ) (its 
encoding) to x. Then we have C(f) = 0( 1), so we conclude that each component of 
an (a, j3)-stochastic pair is (a + <9(log»),j3 +0(logn))-stochastic. 

A philosopher would interpret Theorem[9]as follows: a non-stochastic object can¬ 
not appear in a simple total algorithmic process (unless the input was already non¬ 
stochastic). Note that the condition of totality is crucial here: for every x, stochastic 
or not, we may consider its shortest program p. It is incompressible (and therefore 
stochastic), and x is obtained from p by a simple program (decompressor). 

If a non-stochastic object cannot be obtained by a (simple total) algorithmic trans¬ 
formation from a stochastic one, can it be obtained (with non-negligible probability) in 
a (simple computable) random process? If P is a simple distribution on a finite set of 
strings with rational values, then P can be used as a statistical model, so only objects 
x with high randomness deficiency d(x\P) can be non-stochastic, and the set of all x 
that have d(x\P) greater than some d has negligible /^-probability (an almost direct 
consequence of the deficiency definition). 

So for computable probabilistic distributions the answer is negative for trivial rea¬ 
sons. In fact, much stronger (and surprising) statement is true. Consider a probabilistic 
machine M without input that, being started, produces some string and terminates, 
or does not terminate at all (and produces nothing). Such a machine determines a 
semimeasure on the set of strings (we do not call it measure since the sum of probabili¬ 
ties of all strings may be less than 1 if the machine hangs with positive probability). The 
following theorem says that a (simple) machine of this type produces non-stochastic 
objects with negligible probability. 

Theorem 10 There exists some constant c such that the probability of the event 

“M produces a string of length at most n that is not 
{d + C(M ) +c\ogn,c\ogn)-stochastic” 

is bounded by 2 ~ d for every machine M of described type and for arbitrary integers n 
and d. 

The following results partially explain why this happens. Recall that algorithmic 
information theory defines mutual information in two strings x and y as C (x) + C(y) — 
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C{x,y)\ with 0(\ogn) precision (for strings of length at most n) this expression coin¬ 
cides with C(x) — C(x\y) and C(y) — C(y \x). Recall that by £l„ we denote the number 
of strings of complexity at most n. 

Theorem 11 There exists a constant c such that for every n, for every string x of length 
at most n and for every threshold d the following holds: if a string x of length n is not 
[d + clogn,clogn)-stochastic, then 

I(x : Cl n ) > d — clogn. 

This theorem says that all non-stochastic objects have a lot of information about a 
specific object, the string This explain why they have small probability to appear 
in a (simple) randomized process, as the following result shows. It guarantees that for 
every fixed string w the probability to get (in a simple random process) some object 
that contains significant information about w, is negligible. 

Theorem 12 There exists a constant c such that for every n, for every probabilistic ma¬ 
chine M, for every string w of length at most n and for every threshold d the probability 
of the event 

“M outputs a string x of length at most n such that I (x : w) > C(M ) + d + clogn ” 
is at most 2~ d . 

The last result of this section shows that stochastic objects are “representative” if 
we are interested only in the complexity of strings and their combinations: for every 
tuple of strings one can find a stochastic tuple that is indistinguishable from the first 
one by complexities of its components. 

Theorem 13 For every k there exists a constant c such that for every n and for every 
k-tuple (xi,... ,xf) of strings of length at most n, there exist another k-tuple (y\,... ,yf) 
that is (clogn,clogn)-stochastic and for every I C {1,2,..., n} the difference between 
C(xi) and C(yi) is at most clogn. 

Here xj is a tuple made of strings x, with i £ /; the same for y/. 

This result implies, for example, that every linear inequality for complexities that 
is true for stochastic tuples, is true for arbitrary ones. 

However, there are some results that are known for stochastic tuples but still are not 
proven for arbitrary ones. See m for details. 

9 Restricted classes: Hamming balls as descriptions 

Up to now we considered arbitrary sets as statistical models. However, sometimes we 
have some external information that suggests a specific class of models (and it remains 
to choose the parameters that define some model in this class). For example, if the data 
string is a message sent through a noisy channel that can change some bits, we consider 
Hamming balls as models, and the parameters are the center of this ball (the original 
message) and its radius (the maximal number of changed bits). 

So let us consider some family 8$ of finite sets. To get a reasonably theory, we need 
to assume some properties of this family: 


12 


• The family is computably enumerable: there exists an algorithm that enumer¬ 
ates all elements of £$ (finite sets are here considered as finite objects, encoded 
as lists of their elements). 

• For each n the set of all u-bit strings belongs to 38. 

• There exists a polynomial p such that the following property holds: for every 

for every positive integer n and for every c < #B the set of all h- bit 
strings in B can be covered by p(n)#B/c sets from 38 and each of the covering 
sets has cardinality at most c. 

Here #B stands for the cardinality of B. Counting argument shows that in the last 
condition we need at least #B/c covering sets; the condition says that polynomial over¬ 
head is enough here. 

One can show (using simple probabilistic arguments) that the family of all Ham¬ 
ming balls (for all string lengths, centers and radii) has all three properties. This family 
is a main motivating example for our considerations. 

Now we can define the notion of a 38-(a, j3 )-stochastic object: a string x is 38- 
(a. /j(-stochastic if there exists a set B £ 38 containing x such that CAB) < a and 
d(x\B) < j3. (The original notion of (ct,/3)-stochasticity corresponds to the case when 
38 contains all finite sets.) For every x we get a set S&(x) of pairs (a,/3) for which x 
is ^-(a,j3)-stochastic. We can also define the set 0@(x) using optimality deficiency 
instead of randomness deficiency. The ^-version of Theorem[l]is still true (though the 
proof needs a much more ingenious construction): 

Theorem 14 Let 38 be the family of finite sets that has the properties listed above. 
Then the for every string ,r of length at most n the sets S@(x) and Ogg[x) coincide up 
to a O(\ogn) error term. 

The proof is more difficult (compared to the proof of Theorem [TJ since we now 
need to consider sets in 38 instead of arbitrary finite sets. So we cannot construct the 
required model for a given string x ourselves and have to select it among the given sets 
that cover x. This can be done by a game-theoretic argument. 

It is interesting to note that a similar argument can be used to obtain the following 
result about stochastic finite set (Epstein-Levin theorem): 

Theorem 15 If a finite setX is (a,fi)-stochastic and the total probability 

£ 2"*M 

xex 


of its elements exceeds 2 , then X contains some element x such that 

K{x) < k + K(k) + \ogK(k) + a + O(logf5) +0(1). 

Here K(u) stands for the prefix complexity of u (see, e.g., Ifl5l for the definition). 
To understand the meaning of this theorem, let us recall one of the fundamental results 
of the algorithmic information theory: the (prefix) complexity of a string x equals the 
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binary logarithm of its a priori probability. If we consider a set X of strings instead of 
one string x , we can consider the a priori probability of X (expressing how difficult is 
to get some element of x in a random process) and the minimal complexity of elements 
of X (saying how difficult is to specify an individual element in X). The fundamental 
result mentioned above says that for singletons these two measures are closely related; 
for arbitrary finite sets it is no more the case, but Theorem [15] guarantees that for the 
case for stochastic finite sets. 

Returning to our main topic, let us note that for Hamming balls the boundary curve 
of 0/jg{x) has a natural interpretation. To cover x of length n with a ball B with center y 
having cardinality 2^ and complexity at most a means (with logarithmic precision) to 
find a string y of complexity at most a in the r-neighborhood of x, where r is chosen 
is such a way that balls of radius r have about 2^ elements. So this boundary curve 
represents a trade-off between the complexity of y and its distance to x. 

Again one can ask what kind of boundary curves may appear. As in Theorem [5] 
we can get essentially arbitrary non-increasing function. However, here precision is 
worse: O(\ogn) term is now replaced by 0(y/n log n). 

Theorem 16 Assume that some integers n and k <n are given, and h is a non-increasing 
function mapping {0,1,..., k} to {0,1,..., n — k}. Then there exists a string x of length 
n + 0(yJn\ogn) + 0(C(h)) and complexity k + 0(y/n\ogn) + 0(C(h)) for which the 
set Otjg(x) coincides with the upper-graph ofh (the set {(i,j) \ j > h(i) or i > k }) with 
0[\Jn log n + C(h))-precision. 

Unlike the general case where non-stochastic objects (for which the curve is far 
from zero) exists but are difficult to describe, for the case of Hamming balls one can 
give more explicit examples. Consider some explicit error correction code that has 
distance d. Then every string that differs in at most d /2 positions from some codeword 
x, has almost the same complexity as x (since x can be reconstructed from it by error 
correction). So the balls of radius less than d /2 containing some codeword have almost 
the same complexity as the codeword itself (and the balls of zero radius containing it). 

Let x be a typical codeword of this binary code (its complexity is close to the 
logarithm of the number of codewords). For values of a slightly less than C(x) we 
need a large fi (at least the logarithm of the cardinality of a ball of radius d/ 2) to make 
such a codeword (a, j3)-stochastic. 


10 Historical remarks 

The notion of (a.fi )-stochasticity was mentioned by Kolmogorov in his talks at the 
seminar he initiated at the Moscow State University in early 1980s (see l22l ~). The 
equivalence between this notion and the optimality deficiency (Theorem [T} was dis¬ 
covered in (24) . 

The connections between the existence of adequate models and the position in the 
list of strings of bounded complexity was discovered by Gacs, Tromp and Vitanyi 
in 0, though this paper considered only the position of x in the list of strings of com¬ 
plexity at most C{x). Theorems [2] and [3] appeared in l24l . The paper 0 considered 
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also the canonical models (called “nearly sufficient statistics” in this paper) for the case 
m = C(x). In the general case the canonical models were considered in Il24ll (section 
V, Realizing the structure function), where Theorems [7] and [8] were proven. 

The minimal description length principle goes back to Rissanen ED; as he wrote in 
this paper, “If we work with a fixed family of models, (...) the cost of the complexity 
of a model may be taken as the number of bits it takes to describe its parameters. 
Clearly now, when adding new parameters to the model, we must balance their own 
cost against the reduction they permit in the ideal code length, — logR(x| 9), and we 
get the desired effect in a most natural manner. If we denote the total number of bits 
required to encode the parameters 9 by L(9), the we can write the total code length as 
L{x,9) = — logP(x| 9) +L(9), which we seek to minimize over 9”. The set denoted by 
0(x) in our survey was considered in 1974 by Kolmogorov (see B1 01 ); later it appeared 
in the literature also under the names of “sophistication” and “snooping curves”. 

The notion of sophistication was introduced by Koppel in Ifl2l . Let ji be a natural 
number; P-sophistication of a string x is the minimal length of a total program p such 
that there is a string y with p(y) = x and \p\ + |y| < C(x) + /3. In out terms p defines 
a model that consists of all p(y) for all strings y of a given length. It is not hard to see 
that with logarithm precision we get the same notion: the P -sophistication of x is at 
most a if and only if the pair (a,j3) is in the set O(x). 

The notion of snooping curve L x (a) of x was introduced by V’yugin in [31]. In this 
paper he considered strategies that read a bit sequence from left to right and for each 
next bit provide a prediction (a rational-valued probability distribution on the set {0,1} 
of possible outcomes). After the next bit appears, the loss is computed depending on 
the prediction and actual outcome. The goal of the predictor is to minimize the total 
loss, i.e., the sum of losses at all n stages (for a «-bit sequence). Vyugin considered 
different loss functions, and for one of them, called logarithmic loss function, we get a 
notion equivalent to 0(x). For a logarithmic loss function, we account for loss — log p 
if the predicted probability of the actual outcome was p. It is easy to see that for a given 
x the following statement is true (with logarithmic precision): there exists a strategy of 
complexity at most a with loss at most / if and only if / > h x (a). (Indeed, prediction 
strategies are just bit-by-bit representation of probability distributions on the set of «-bit 
strings, in terms of conditional probabilities.) 

Theorem [4] (Section 0 is due to Bauwens 0. The idea to consider the difference 
between time bounded complexity of x and the unbounded one goes back to Chaitin E). 
Later the subject was studied by Bennett who introduced the notion of logical depth: 
the depth ofx at significance level p is the minimal time t such that C‘ (x) < C(x) +/3. 
The string is called (j3,f)-deep if its depth at significance level p is larger than t. A 
closely related notion of computational depth was introduced in (TJ: the computational 
depth of x with time bound t is C f (x) — C(x). Obviously, computational depth of x 
with time bound t is more than p if and only if x is (p,t)-dccp. Theorem 0] relates 
both notions of depth to the stochasticity profile (with logarithmic precision): a string 
is {ft ,B(a)) -deep if and only if the pair (a,/3) is outside the set <9(x). 

Theorem 0] was proved in B24[ . Long before this paper (in 1987) V’yugin estab¬ 
lished that set S(x) can assume all possible shapes (within the obvious constraints) but 
only for a = o(|x|). Also, according to Levin El: “Kolmogorov told me about h ja) 
and asked how it could behave. I proved that h x ( a ) + a + 0(log a) is monotone but 
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otherwise arbitrary within EOip log oc) accuracy where p is the number of “jumps” 
of the arbitrary function imitated; it stabilizes on C(x) when a exceeds I(% : x ) [the 
information in the characteristic sequence % of the “halting problem” about x]. The 
expression for accuracy was reworded by Kolmogorov to 0(\Jcx logo;) [square root 
accuracy]', I gave it in the above, less elegant, but equivalent, terms. He gave a talk 
about these results at a meeting of Moscow Mathematical Society S3U-” This claim of 
Levin implies Theoremflllthat was published in l24j . 

Theorem [6] (mentioned in 112211 ) is easy and Theorem [9] easily follows from Theo- 
rem[5] 

The existence of non-(a, j3)-stochastic strings (for small a, j3) was mentioned in 1 22l . 
Then V’yugin |[29l and Muchnik lfl9l showed that that their a priori measure is about 
2 a direct corollary of which is our TheoremflOl 

Theorems ITT1 and [l2l are essentially due to Levin (see Ifl4l and fl3l ). 

Theorem [L3l is easy to prove using A. Romashchenko’s “typization” trick (see |9j 

ED). 

Theorems [l4l and IT6l appeared in |25l ; Theorem[l5]appeared in J8j. 
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